{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# LlamaIndex Bottoms-Up Development - Documents and Nodes\n",
    "In order to answer questions about the LlamaIndex docs, we first need to load them!\n",
    "\n",
    "A majority of our documentation is in markdown format. For the sake of scope, we will ONLY worry about markdown files for now.\n",
    "\n",
    "When parsing these files, there are a few things we might want to keep track of\n",
    "\n",
    "- Current header (and header hierarchy!)\n",
    "- Code blocks\n",
    "- Text\n",
    "- Source file names\n",
    "\n",
    "While LlamaIndex does have a built-in markdown loader, we can write our own to fit our requirements exactly! Loaders are not magic -- they just read files and create documents. So building our own is easy!\n",
    "\n",
    "We have provided an implementation of a custom markdown loaded in the source code. Let's test it out to see how it works!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import sys\n",
    "sys.path.append(os.path.join(os.getcwd(), '..'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_docs_bot.markdown_docs_reader import MarkdownDocsReader\n",
    "from llama_index import SimpleDirectoryReader\n",
    "\n",
    "def load_markdown_docs(filepath):\n",
    "    \"\"\"Load markdown docs from a directory, excluding all other file types.\"\"\"\n",
    "    loader = SimpleDirectoryReader(\n",
    "        input_dir=filepath, \n",
    "        exclude=[\"*.rst\", \"*.ipynb\", \"*.py\", \"*.bat\", \"*.txt\", \"*.png\", \"*.jpg\", \"*.jpeg\", \"*.csv\", \"*.html\", \"*.js\", \"*.css\", \"*.pdf\", \"*.json\"],\n",
    "        file_extractor={\".md\": MarkdownDocsReader()},\n",
    "        recursive=True\n",
    "    )\n",
    "\n",
    "    return loader.load_data()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load our documents from each folder.\n",
    "# we keep them seperate for now, in order to create seperate indexes later\n",
    "getting_started_docs = load_markdown_docs(\"../docs/getting_started\")\n",
    "community_docs = load_markdown_docs(\"../docs/community\")\n",
    "data_docs = load_markdown_docs(\"../docs/core_modules/data_modules\")\n",
    "agent_docs = load_markdown_docs(\"../docs/core_modules/agent_modules\")\n",
    "model_docs = load_markdown_docs(\"../docs/core_modules/model_modules\")\n",
    "query_docs = load_markdown_docs(\"../docs/core_modules/query_modules\")\n",
    "supporting_docs = load_markdown_docs(\"../docs/core_modules/supporting_modules\")\n",
    "tutorials_docs = load_markdown_docs(\"../docs/end_to_end_tutorials\")\n",
    "contributing_docs = load_markdown_docs(\"../docs/development\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Make our printing look nice\n",
    "from llama_index.schema import MetadataMode"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "File Name: ../docs/core_modules/agent_modules/agents/root.md\n",
      "Content Type: text\n",
      "Header Path: Data Agents/Concept/Tool Abstractions\n",
      "\n",
      "You can learn more about our Tool abstractions in our Tools section.\n"
     ]
    }
   ],
   "source": [
    "print(agent_docs[5].get_content(metadata_mode=MetadataMode.ALL))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'File Name': '../docs/core_modules/agent_modules/agents/modules.md', 'Content Type': 'text', 'Header Path': 'Module Guides'}\n"
     ]
    }
   ],
   "source": [
    "print(agent_docs[0].metadata)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looks not bad! We can see that we have metadata, as well as nicely formatted content.\n",
    "\n",
    "But, we can improve the formatting even further! We can provide better templating, so that the LLM and embedding models can get a better idea of what they are reading."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "text_template = \"Content Metadata:\\n{metadata_str}\\n\\nContent:\\n{content}\"\n",
    "\n",
    "metadata_template = \"{key}: {value},\"\n",
    "metadata_seperator= \" \"\n",
    "\n",
    "for doc in agent_docs:\n",
    "    doc.text_template = text_template\n",
    "    doc.metadata_template = metadata_template\n",
    "    doc.metadata_seperator = metadata_seperator"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Content Metadata:\n",
      "File Name: ../docs/core_modules/agent_modules/agents/modules.md, Content Type: text, Header Path: Module Guides,\n",
      "\n",
      "Content:\n",
      "These guide provide an overview of how to use our agent classes.\n",
      "\n",
      "For more detailed guides on how to use specific tools, check out our tools module guides.\n"
     ]
    }
   ],
   "source": [
    "print(agent_docs[0].get_content(metadata_mode=MetadataMode.ALL))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Advanced Customization\n",
    "Going even further with metadata, we can also customize which metadata fields will be seen by both the embedding model and LLM."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Content Metadata:\n",
      "Content Type: text, Header Path: Module Guides,\n",
      "\n",
      "Content:\n",
      "These guide provide an overview of how to use our agent classes.\n",
      "\n",
      "For more detailed guides on how to use specific tools, check out our tools module guides.\n"
     ]
    }
   ],
   "source": [
    "# Hide the File Name from the LLM\n",
    "agent_docs[0].excluded_llm_metadata_keys = [\"File Name\"]\n",
    "print(agent_docs[0].get_content(metadata_mode=MetadataMode.LLM))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Content Metadata:\n",
      "Content Type: text, Header Path: Module Guides,\n",
      "\n",
      "Content:\n",
      "These guide provide an overview of how to use our agent classes.\n",
      "\n",
      "For more detailed guides on how to use specific tools, check out our tools module guides.\n"
     ]
    }
   ],
   "source": [
    "# Hide the File Name from the embedding model\n",
    "agent_docs[0].excluded_embed_metadata_keys = [\"File Name\"]\n",
    "print(agent_docs[0].get_content(metadata_mode=MetadataMode.EMBED))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Conclusion\n",
    "In this notebook, we covered how to use a custom data loader, as well as how to customize the text representations of your data when including metadata for both LLMs and embedding models."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
